A Best-Match Algorithm for Broad-Coverage Example-Based Disambiguation
نویسنده
چکیده
To improve tit(.' coverage of examl)le-bases , two nlethods are introduced into the 1)est-match algor i thm. The first is for acquiring conjunctive relat ionships fl'om corpora, as measures of word similarity t h a t can be used in addit ion to thesauruses. The Second, used when a word does not appear in an examltled)asc or a thesaurus, is for inferring links to words in the examph>base by ( 'mnparing the usage of the word in the text ~md tha t of words in the examplebase. 1 I n t r o d u c t i o n Improvement of cow, rage in practical domains is one of the most impor tan t issues in the area of examplebased systems. The examl)le-based apI)roach [6] has become a (:amman technique for m~turM language processing apI)lications such as machine t ranslat ion *rod disambiguatkm (e.g. [5, 10]). However, few existing systems can cover a practical domain or handle a l)road range of phenomena. The most serious obstacle to robust examplebased systems is the coverage of examt)le-bases. It is an oi)en question how many e~xaml)les are required for disambiguat ing sentences in a specific domain. The Sentence AnMyzer (SENA) wax developed in order to resolve a t tachment , word-sense, and conjunctive anlbiguitics t)y using constraints and example-based preferences [11]. It lists at)out 57,000 disambiguated head-modifier relationships and al)out 300,000 synonyms and is-a 1)inary~ relationships. Even so, lack of examl)les (no relevant examlfles ) accounted for 46.1% of failures in a experiment with SENA [12]. Previously, it was believed to be easier to collect examples than to develop rules for resolving ambiguities. However, the coverage of each examltie is nmch nlore local than a rule, and therefore a huge munber of examt)les is required in order to resolve realistic 1)rot)lems. There has been some carl)uSbased research (m how to acquire large-scah~ knowledge automati(-ally in order to cover the domain to be disambiguatcd, lint there are still major 1)rot)l c n l s t o ])e o v e r e o n l e . First, smmmtic kvowledge such as word-sense cannot be extracted by automat ic cort)u~-base(l knowledge, acquisition. The example-base in SENA is deveh)l)ed by using a bootstr~q)ping method. However, the results of word-sense disambiguat ion nmst be (:he(:ked by a hutnan, a,nd word-senses are tagged to only about ;t half of all the examt)les , since the task is very time-consmning. A second ditliculty in the exalnple-t)ased attproach ix the algori thm itself, namely, the be.stmatch algorithm, which was used in earlier systems built around a thesaurus t ha t consisted of a hierttrchy of is-a or synonym relationships between words (word-senses). This paper proposes two methods for ilnproving the coverage of exantple-bases. The selected domain is th~tt of sentences in comt)uter manmds. First, knowledge thtd; represents a type of similarity other than synonym or is-a relationships is a(> quired. As one measurement of the similarity, interchangeability between words (:~m be used. In this paper, two types of the relationship reflect such interchangeability. First, the elements of coordinated s t ructures are good clues to the interchangeat)ility of words. Words can be extracted easily from a dolnain-specitic carl)us , and therefore the examplebase can I)e adapted to the sl)ecific domain by using the domain-specific relationships. If there are no examples and relations in the thesaurus, the example-base gives no information for disambiguation. However, the text to be disam1)iguate.d provides useful knowledge for this purpose [7, 3]. '['he relationshit)s between words in the example-base and ;ut unknown word can be guessed by comi)aring tha t word's usage in extracted cxantples and in the text. 2 A B e s t M a t c h A l g o r i t h m In this section, conventional algori thms for exami)le-b~tsed disalnl)iguation~ art(1 their associate(i prol)lems, a.re briefly introduced. The algori thms of lnost examph>l)ased systems consist of the following three steps~: till some systenls, the exac t -mah :h ttl|(I Lhe bes t -ma tch ~tr(! ll/orge({.
منابع مشابه
Getting Serious About Word Sense Disambiguation
Recent advances in large-scale, broad coverage part-of-speech tagging and syntactic parsing have been achieved in no small part due to the availability of large amounts of online, human-annotated corpora. In this paper, I argue that a large, human sensetagged corpus is also critical as well as necessary to achieve broad coverage, high accuracy word sense disambiguation, where the sense distinct...
متن کاملA Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge
Word sense disambiguation is the process of determining which sense of a word is used in a given context. Due to its importance in understanding semantics of natural languages, word sense disambiguation has been extensively studied in Computational Linguistics. However, existing methods either are brittle and narrowly focus on specific topics or words, or provide only mediocre performance in re...
متن کاملA Comparison of Evaluation Metrics for a Broad-Coverage Stochastic Parser
This paper reports on the use of two distinct evaluation metrics for assessing a stochastic parsing model consisting of a broad-coverage Lexical-Functional Grammar (LFG), an efficient constraint-based parser and a stochastic disambiguation model. The first evaluation metric measures matches of predicate-argument relations in LFG f-structures (henceforth the LFG annotation scheme) to a gold stan...
متن کاملExemplar-Based Word Sense Disambiguation" Some Recent Improvements
In this paper, we report recent improvements to the exemplar-based learning approach for word sense disambiguation that have achieved higher disambiguation accuracy. By using a larger value of k, the number of nearest neighbors to use for determining the class of a test example, and through 10-fold cross validation to automatically determine the best k, we have obtained improved disambiguation ...
متن کاملBroad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger
This paper presents a novel approach to broad-coverage word sense disambiguation and information extraction. The task consists of annotating text with the tagset defined by the 41 Wordnet supersense classes for nouns and verbs. Since the tagset is directly related to Wordnet synsets, the tagger returns partial word sense disambiguation. Furthermore, since the noun tags include the standard name...
متن کامل